2020-05-07

Introduction


Work Environment Survey (WES)

  • Survey conducted by BC Stats for employees of BC Public Service.

  • Measures the health of the work environments.

  • 80 multiple choice questions (5 point scale) and 2 open-ended questions.

  • 2013, 2015, 2018, and 2020 across 26 Ministries.

Introduction

Open-ended Questions

Question 1

What one thing would you like your organization to focus on to improve your work environment?

Example: "Better health and social benefits should be provided."


Question 2

Have you seen any improvements in your work environment and if so, what are the improvements?

Example: "Now we have more efficient vending machines."



*Note: these are fake comments as examples of the data.

Objectives

Overarching goal:
Use automated multi-label theme classification of comments to themes and sub-themes.

Question 1
What one thing would you like your organization to focus on to improve your work environment?

  • Build a model for predicting label(s) for main themes.
  • Build a model for predicting label(s) for sub-themes.
  • Scalability: Identify trends across ministries and over the four specified years.


Question 2
Have you seen any improvements in your work environment and if so, what are the improvements?

  • Identify labels for theme classification and compare with existing labels.
  • Build a model for predicting label(s) for themes.
  • Create visualizations for executives to explore the results.

Existing Solution for Question 1

Last year's Capstone

Getting Familiar with the Data

  • Separate Data for each question, and each year.
  • Comments with sensitive information.
  • Files in XLSX -Excel format-.

Question 1
What one thing would you like your organization to focus on to improve your work environment?

  • Labeled data from 2013, 2018, 2020, added to around 32,000 respondents.

Question 2
Have you seen any improvements in your work environment and if so, what are the improvements?

  • Labeled data from 2018, which add around 6,000 respondents.
  • Unlabeled data from 2015 and 2020, that respresent 9,000 additional comments.

EDA

Question 1

Dataset format
Responses for this question are captured and labeled (theme and sub-theme) by hand:


Comments* CPD CB EWC CB_Improve_benefits CB_Increase_salary
Better health and social benefits should be provided 0 1 0 1 0



Theme: CB = Compensation and Benefits

Sub-theme: CB_Improve_benefits = Improve benefits

*Note: this is a fake comment as an example of the data.

EDA

Question 1

Labels: 13 themes and 63 sub-themes.

Label cardinality for themes: ~1.4

EDA

Question 1

Label cardinality for sub-themes: ~1.6

EDA

Question 2

Labels for 2018: 6 themes and 16 sub-themes

Label cardinality: ~1.6

Challenges

  • Decide appropriate metric for evaluating accuracy (considering partial correctness) for multi-label prediction problem.


  • Low label cardinality indicating sparsity in training data
  • ~2 labels per comment from ~60 labels.


  • Build a model with increased performance - higher label precision and recall- than the MDS team last year so that it can be deployed by BC Stats.


  • Class Imbalance in the data
  • skeweness in number of comments per label.

Techniques

Question 1

Techniques

Question 1

Techniques

Question 2

Theme Identifications

  • Use clustering algorithms like PCA and Topic Modelling



Scalability

  • Descriptive Statistics using Matplotlib, Altair and Plotly
  • Identify trends over the years
  • Identify trends across Ministries

Deliverables

  • Data pipeline with the documentation for our models


  • Dash app that displays the trends across ministries for both qualitative questions

Source: Dash app's sketch, based in app developed by BC Stats for the Workforce Profiles Report 2018.
Note: This figure is just for illustrative purpose, the final version of the app could differ from the sketch.

Timeline